Tight and Simple Web Graph Compression

نویسندگان

  • Szymon Grabowski
  • Wojciech Bieniecki
چکیده

Analysing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithm involving compression techniques have thus been presented, to represent Web graphs succinctly but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammarbased transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present a Web graph compression algorithm which can be seen as engineering of the Boldi and Vigna (2004) method. We extend the notion of similarity between link lists, and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lines) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. Additionally, we show a simple idea for 2-dimensional graph representation which also achieves state-of-the-art compression ratio.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tight and simple Web graph compression for forward and reverse neighbor queries

Analyzing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities andmirror sites, andmore. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to repr...

متن کامل

Buckling and failure characteristics of slender web I-column girders under interactive compression and shear

Geometric and material nonlinear behavior of slender webs in I-column girders having stocky flanges under the action of combined lateral and axial loads is investigated. Interaction curves corresponding to the application of compressive and shear loads at buckling and ultimate stages for both web plates and column sections are plotted. In addition, the effects of flange and web slenderness rati...

متن کامل

A Simple Algorithm for Compressing Web-like Graphs Efficiently

We introduce an efficient compression algorithm for web-like graphs that exploits the graph’s structure to achieve better compression rate. In particular, we make use of the locality of reference in the graph, the node similarity and the power law distribution of its nodes’ degrees, three properties usually observed in large sparse graphs that model networks created by human activity. Furthermo...

متن کامل

Web-graph Pre-compression for Similarity Based Algorithms

The size of web-graph created from crawling the web is an issue for every search engine. The amount of data gathered by web crawlers makes it impossible to load the web-graph into memory which increases the number of I/O operations. Compression can help reduce run-time of web-graph processing algorithms. We introduce a new algorithm for compressing the link structure of the web graph by groupin...

متن کامل

Finding Community Base on Web Graph Clustering

Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010